-
Notifications
You must be signed in to change notification settings - Fork 746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dualtor] Improve mux_simulator
#16164
Conversation
Signed-off-by: Longxiang Lyu <[email protected]>
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 2 out of 3 changed files in this pull request and generated no comments.
Files not reviewed (1)
- ansible/roles/vm_set/templates/mux-simulator.service.j2: Language not supported
Comments suppressed due to low confidence (2)
ansible/roles/vm_set/files/mux_simulator.py:955
- The variable 'default_handler' is not defined, which will raise a NameError. Define 'default_handler' before using it.
app.logger.removeHandler(default_handler)
ansible/roles/vm_set/files/mux_simulator.py:947
- The new behavior introduced in the 'setup_mux_simulator' function is not covered by tests. Add tests to cover this new behavior.
def setup_mux_simulator(http_port, vm_set, verbose):
Signed-off-by: Longxiang <[email protected]>
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 2 out of 3 changed files in this pull request and generated 1 comment.
Files not reviewed (1)
- ansible/roles/vm_set/templates/mux-simulator.service.j2: Language not supported
Signed-off-by: Longxiang <[email protected]>
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
What is the motivation for this PR? The dualtor nightly suffers from mux simulator timeout issue, and there are always HTTP timeout failures observed. This PR tries to improve the mux simulator performance: improve the all mux toggle performance. improve the mux simulator read/write throughput. PR 1522 was a quick fix to address, but it was a temporary quick fix. How did you do it? run mux simulator with gunicorn instead of its own built-in HTTP server. The mux simulator is running with Flask's own built-in HTTP server. Previously, the mux simulator is running with single-threaded mode, which limits its performance && throughput; and the mux simulator is observed stuck in reading from dead connection; PR 1522 proposes a temporary by running mux simulator in threaded mode. The throughput is improved with the threaded approach, but the built-in server limits the tcp listen backlog to 128, and it is designed for development/test purpose and not recommended to be deployed(Flask's deployment doc). So let's run the mux simulator with gunicorn: better performance/throughput with customized worker count increased tcp listen backlog use thread pool to parallel the toggle requests. The mux simulator handles the toggle-all request by toggling each mux port one by one, let's use a thread pool to parallel run thoses toggles to further decrease the response time. How did you verify/test it? Run the following benchmarks on a dualtor-120 testbed, and compare the performance of: A: the original mux simulator, with Flask built-in server in single-thread mode. B: the mux simulator with Flask built-in server in threaded mode. C: the mux simulator with this PR. toggle mux status for all mux ports(one request to toggle one mux port): 20 concurrent users, repeated 2000 times mux simulator version A B C elapse time 96s 37s 36s toggle mux status for all mux ports(one request to toggle all mux ports): 1 user, repeated 1 time. mux simulator version A B C elapse time 16s 16s 7s To summarize, mux simulator with this PR has the best performance in toggles. Any platform specific information? Supported testbed topology if it's a new test case? Signed-off-by: Longxiang Lyu <[email protected]>
Cherry-pick PR to 202405: #16369 |
What is the motivation for this PR? The dualtor nightly suffers from mux simulator timeout issue, and there are always HTTP timeout failures observed. This PR tries to improve the mux simulator performance: improve the all mux toggle performance. improve the mux simulator read/write throughput. PR 1522 was a quick fix to address, but it was a temporary quick fix. How did you do it? run mux simulator with gunicorn instead of its own built-in HTTP server. The mux simulator is running with Flask's own built-in HTTP server. Previously, the mux simulator is running with single-threaded mode, which limits its performance && throughput; and the mux simulator is observed stuck in reading from dead connection; PR 1522 proposes a temporary by running mux simulator in threaded mode. The throughput is improved with the threaded approach, but the built-in server limits the tcp listen backlog to 128, and it is designed for development/test purpose and not recommended to be deployed(Flask's deployment doc). So let's run the mux simulator with gunicorn: better performance/throughput with customized worker count increased tcp listen backlog use thread pool to parallel the toggle requests. The mux simulator handles the toggle-all request by toggling each mux port one by one, let's use a thread pool to parallel run thoses toggles to further decrease the response time. How did you verify/test it? Run the following benchmarks on a dualtor-120 testbed, and compare the performance of: A: the original mux simulator, with Flask built-in server in single-thread mode. B: the mux simulator with Flask built-in server in threaded mode. C: the mux simulator with this PR. toggle mux status for all mux ports(one request to toggle one mux port): 20 concurrent users, repeated 2000 times mux simulator version A B C elapse time 96s 37s 36s toggle mux status for all mux ports(one request to toggle all mux ports): 1 user, repeated 1 time. mux simulator version A B C elapse time 16s 16s 7s To summarize, mux simulator with this PR has the best performance in toggles. Any platform specific information? Supported testbed topology if it's a new test case? Signed-off-by: Longxiang Lyu <[email protected]>
This reverts commit 8c14bdd.
This reverts commit 9f2412d.
What is the motivation for this PR? Use thread-pool to parallel run the mux toggles. This code is from PR: #16164, which is reverted. Let's have the change here. Signed-off-by: Longxiang [email protected] How did you do it? As the motivation. How did you verify/test it? Run on dualtor/dualtor-120 testbed. Signed-off-by: Longxiang <[email protected]>
…6508) What is the motivation for this PR? Use thread-pool to parallel run the mux toggles. This code is from PR: sonic-net#16164, which is reverted. Let's have the change here. Signed-off-by: Longxiang [email protected] How did you do it? As the motivation. How did you verify/test it? Run on dualtor/dualtor-120 testbed. Signed-off-by: Longxiang <[email protected]>
What is the motivation for this PR? Use thread-pool to parallel run the mux toggles. This code is from PR: #16164, which is reverted. Let's have the change here. Signed-off-by: Longxiang [email protected] How did you do it? As the motivation. How did you verify/test it? Run on dualtor/dualtor-120 testbed. Signed-off-by: Longxiang <[email protected]>
Description of PR
Summary:
Fixes # (issue)
Type of change
Back port request
Approach
What is the motivation for this PR?
The dualtor nightly suffers from mux simulator timeout issue, and there are always HTTP timeout failures observed.
This PR tries to improve the mux simulator performance:
PR 1522 was a quick fix to address, but it was a temporary quick fix.
How did you do it?
gunicorn
instead of its own built-in HTTP server.The mux simulator is running with
Flask
's own built-in HTTP server. Previously, the mux simulator is running with single-threaded mode, which limits its performance && throughput; and the mux simulator is observed stuck in reading from dead connection; PR 1522 proposes a temporary by running mux simulator in threaded mode. The throughput is improved with the threaded approach, but the built-in server limits the tcp listen backlog to 128, and it is designed for development/test purpose and not recommended to be deployed(Flask
's deployment doc).So let's run the mux simulator with
gunicorn
:The mux simulator handles the toggle-all request by toggling each mux port one by one, let's use a thread pool to parallel run thoses toggles to further decrease the response time.
How did you verify/test it?
Run the following benchmarks on a dualtor-120 testbed, and compare the performance of:
Flask
built-in server in single-thread mode.Flask
built-in server in threaded mode.To summarize, mux simulator with this PR has the best performance in toggles.
Any platform specific information?
Supported testbed topology if it's a new test case?
Documentation